Testing for equality between two transformations of random 

variables 

Mohamed BOUTAHAR **and Denys POMMERET t t 
November 1, 2011 



Consider two random variables contaminated by two unknown transformations. 

The aim of this paper is to test the equahty of those transformations. Two cases are 
distinguished: first, the two random variables have known distributions. Second, they 
are unknown but observed before contaminations. We propose a nonpar ametric test 
statistic based on empirical cumulative distribution functions. Monte Carlo studies are 
performed to analyze the level and the power of the test. An illustration is presented 
through a real data set. 
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1 Introduction 

There exists an important literature concerning the deconvolution problem, when an un- 
known signal Y is contaminated by a noise Z, leading to the observed signal 



A major problem is to reconstruct the density of Y. Many authors studied the univariate 
problem when the noise Z has known distribution (see for instance Fan [10], Carroll and 
Hall [3], Devroye [7], or more recently Holzmann et al. [12] for a review). Bissantz et 
al. [1] proposed the construction of confidence bands for the density of Y based on i.i.d. 
observations from (1). The case where both Y and Z have unknown distributions is 
considered in Neumann [15], Diggle and Hall [8] or Johannes et al. [13] among others. When 
the error density and the distribution of Y have different characteristics the model can be 
identified as shown in Butucea and Matias [2] and Meister [14]. But without information 
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Abstract 



X = Y + Z. 




on Z, the model suffers of identification conditions. One solution is to assume another 
independent sample is observed from the measurement error Z (as done in Efromovich and 
Koltchinskii [9] and Cavalier and Hengartner [5]). 

A more general model than (1) occurs when the contaminated random variables are 
observed through a transformation; that is, there exists g such that 

X = g{Y + Z). (2) 

When g is known the problem is to estimate the distribution of Y , observing a sample from 
(2). An application of this model to fluorescence lifetime measurements is given in Comte 
and Rebafka [6]. The authors developed an adaptative estimator that take into account 
the perturbation from the unknown additive noise, and the distortion due to the nonlinear 
transformation. 

In this paper we consider a two sample problem of contamination that can be related 
to models (1) and (2) as follows: We assume that two contaminated random variables are 
observed, say X and X, which are transformations of two known, or observed, signals, that 
is: 

X = g{Y), X = ~g{Y), (3) 

where g and g are continuous monotone unknown functions. Our purpose is to test 

Ho-g = g against Hi:g^g, (4) 

based on two i.i.d. samples satisfying (3). The problem of testing (4) is of interest in many 
applications when a signal is noised in another way than the additive noise model (1). We 
will distinguish two important cases: 

Case 1 The distributions of Y and Y are known and we observe two samples reflecting X 
and X. This situation may be encountered when two signals are controlled in entry 
but observed with perturbations in exit of a system. 

Case 2 The distributions of Y and Y are unknown and we first observe two independent 
samples based on Y and Y , and then we observe contaminated samples X and X 
satisfying (3). This situation may be encountered when two unknown signals are 
observed both in entry and in exit of a system. 

For both cases we construct a test statistics based on non parametric empirical estimators 
of g and g and we adapt a limit result on empirical processes due to Sen [16]. Our test 
statistics are very easily implemented and we observe through simulations that they have 
a good power against various alternatives. It is clear that when is not rejected; that is 
when the two noise functions are identical, it is then of interest to interpret the common 
estimation of g. We illustrate this point with a study of the Framingham dataset (see 
Carroll et al. [4], and more recently Wang and Wang [17]). 

The paper is organized as follows: in Section 1 we consider the problem when the two 
original signals have known distributions. In Section 2 we relax the last assumption by 
assuming unknown distributions but we observe the two original signals after and before 
perturbations. In Section 3 a simulation study is presented and a real data set is analyzed. 
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2 The test statistic 



2.1 Case 1: the two signal distributions of Y and Y are known 

We consider n (resp. n) i.i.d. observations Xi,--- ,X„ (resp. Xi,--- ,Xn) from (3). We 
assume that Y and Y are independent. Write Fy and the cumulative distribution 
functions of Y and Y respectively. We assume that these functions are known and invert- 
ible. We also write Fx and F-^ the cumulative distribution functions of X and X. Also 
we assume that the transformations g and g are monotone and, without loss of generality, 
that they are increasing. Note that g{y) = F^^^Fyiy)) and g{y) = F^^{Fy(ii))- Hence a 
natural nonparametric estimators of the contaminating functions are given by 

gi.-) = Xi^riFy {■)]+!) and ^(•) = (5) 

where X^j) and denote the ith order statistics, and [x] denotes the integer part of the 
real x. A fundamental theorem of Sen [16] states the following convergence in distribution 



where denotes the convergence in distribution, / denotes the density of X and M{m, a 
the Normal distribution with mean m and variance o^. We will need the following two 
standard assumptions: 

• {Ai) there exists a < oo such that n/{n + h)^a 

• {A2) / > and / is C*^, for some positive integer k. 

We deduce a first result which is a main tool for the construction of the test statistic. 
Proposition 2.1 Let Assumption {Ai) - {A2) hold. Under Hq we have 



2^ 



nn 
n + h 



(diy) - g{y)) ^ -^(O^ o-^(y)), as n ^ 00, n ^ 00, (7) 



where 



miv)) 4(9(9)) 

Proof. It follows directly from (6), replacing p by and FY{y) respectively. 

We will estimate the variance by using a nonparametric method. Consider a kernel 
K{-), for instance the quartic kernel defined by K{y) = j|(l — y^)^l(-i,i)(y), and an 
associated bandwidth In the sequel, we will set Kh^(y) = K(^). To avoid small 
values for denominators in the estimation of the variance we use 
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fx{y) = max ^J— ^ifft„(Xj-2/),e„^ 

i 

where e„ > and e„ — >■ when n tends to infinity. The estimator of cr^ is then 

^2. ^ n , Fyiy)il-Fyiy)) 
a [y) = (1 - a) ^ \- a— — ^^r^^ — 



TM = -^aiy)-^ (g{y) -^{y)y . (8) 
n + n \ J 



n 



m{y)) miy)) 

and we consider the statistic 

c 

n + n 

Proposition 2.2 Let Assumptions {Ai)-{A2) hold. If ^ n — n '^^ , Cfi — n for some 
positive constants ci and C2 such that ^ < ci < jq^, then under Ho, when n — >■ oo, 
oo, we have for all y: 

Ti{y)^Z, 

where Z is Chi-squared distributed with one degree of freedom. 

Proof. We need the fundamental Lemma (see Hardle [11]): 
Lemma 2.1 

sup\f\y)-f{y)\=Op(hf + 

We can write 



2k . log^ 



nhn 



hiaiv)) Miy)) 



where u{y) = (1 - a)Fy(y)(l - Fy(?/)) and v{y) = aFy(y)(l - Fyiy))- Using Taylor 
expansion there exist A and B such that 



d\y) = 'y\y)+[fmy))- f\9{y))][^j + \^fx{m)- r{9{y))j\^^2 



with 



11 ,11 



Then, from Lemma 2.1 we get 

a^y) = a\y) + op{l), 
by assumption and the result follows from Proposition 2.1. 
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2.2 Case 2: the two signal distributions Y and Y are unknown 

We consider Ux (resp. rix) i.i.d. observations Xi,--- ,Xn^ (resp. Xi,--- ,Xn:^) and Uy 
(resp. riy) i.i.d. observations Yi,--- ,Yny (resp. Y,--- ,Yny) from (3). Put 

N = UxTiy/ {hx + Uy) and N = nxfiy/ {rix + ny). 

The two samples Yi , • • • , Yn^ and Yi , • • • , Yn^ can be viewed as two independent training 
sets which permit to estimate the initial densities of the signals before perturbations. Again 
we want test Hq : g = g. We now estimate g and g by 



^(•) = ^(KF,(.)]) ^(•) = l([-^p.(.)]3, (9) 

where 

^ riy ^ riy 



" «=i " 1=1 



are the empirical distribution functions of Y and F respectively. We assume that 

limraa;/(na; + %) = a < oo, limrij;/(na; + %) = a < oo, 
and we make the following assumption, extending Assumption (Al): 

• (^3) there exists 6 < 00 such that N/{N + N) ^ b. 
We can extend Proposition 2.1 as follows: 

Proposition 2.3 Let Assumption {A2) - (A3) hold. Under Hq we have 



I NN 
N + N 



(diy) - Ky)) ^ -Ar(0, a\y)), as at ^ oo, at ^ oo, (11) 



where 



Proof. We first show that 



U = ^"^^ (giy) - g{y)) 4 A^(0, al{y)), as Ux 00, Uy^ 00, 

where 

2 i^y(y)(l-Fy(j/)) 



(12) 
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For that write 

3{y)-9{y) = G{y) + G{y), 

where G{y) = g{y) - F^^Fyiy)) = - F^HMv)) and G{y) = F^^Fyiy)) - 

g{y). By the delta method we get 

nl/'G{y) ^ M{0,af{y)). 
Then we decompose the characteristic function 



where « = . / ^^^^ and Y stands for the vector of observation Fi, • • • , . 



Since these functions are bounded we get: 



rix + riy 



Urn E(eMiuU)) = El lim e^""^'«^ lim fe^""-'^^|Y H 

= ( e'"^^ lim e-^(^-'')^i(^) 



where Z ~ AA(0, af (?/)) and = ^^^^i^!",^";^^" • We finally obtain 



lim E(exp(iuU)) = exp(-l/2u^cri(y)). 

Similarly, writing 



u = J (Hy) - ~9{y)) , 

^ rix + ny \ / 

we obtain that 

U M{0,d-f{y)), as hx ^ oo, hy ^ oo, 



with 



Fy{y){l-Fy{y)) 



Finally, combining these two convergences with the equality g = g under Hq we com- 
plete the proof. 

As previously we can estimate cr^(j/) in (12) by a nonparametric estimator 



a^iy) = ^,_,fyiy)i^:^y(y))^,My)i^-My)) 
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where Fy and Fy are the empirical distribution functions of Y and Y given by (10). Our 
test statistic is given by 

NN o / \2 

^'^y^ = ]^^^^H^(^)-^(y)) • ^^^^ 

We can now generaUzc Proposition 2.2 as follows. 

Proposition 2.4 Let Assumptions (A2)-(^3) hold. If /i„ ~ n~^i, e„ ~ n"^^ for some 
positive constants ci and C2 such that ^ < ci < jq^, then under i^o, when — )• oo, iV — )• 
oo, we have: 

12 4 z, 

where Z is Chi-squared distributed with one degree of freedom. 

Proof. We combine the proof of Proposition 2.1 with the fact that F{1 — F) is 
bounded to get 

^\y) = a\y) + op{\), 
and we conclude by Proposition 2.3. 



2.3 Behaviour of the tests under Hi 

We study convergence properties of the tests Ti and T2 under some alternatives 
Proposition 2.5 

a. General alternatives. 
Consider the test statistics Ti and T2, then for all y such that g{y) / g{y), we have 

Ti{y) 4 +00 and T2{y) 4 +00, 

P 

where — >■ denotes the convergence in probability. 



b. Local alternatives. 
■ us denote m = t 

n+n 

used and consider the local alternatives 



Let us denote m = or m = according to whether if the test statistic Ti or T2 is 



Ha:g{y)=g{y) + ^, 



then under Hn and when n — >■ 00, h — >■ 00, N — >■ 00, N 00 we have for all y: 
i. > 1/2 then 

Tiiy)^ Z andT2{y)^ Z 
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a. If 13 = 1/2 then 



in. If < 1/2 then 



Ti{y) 4 +00 and T2{y) 4 +oo 

where Z is Chi-squared distributed with one degree of freedom and Z^ is a decentred Chi- 
squared distributed with one degree of freedom and parameter k{y). 

The proof of this proposition is straightforward and hence is omitted. 

Remark 2.1 Estimators g (resp. g ) are computed from {Xi, ■ ■ ■ , X^^) and [Yi, ■ ■ ■ , Yny) 
(resp. {Xi, • • • , Xn^) and (Yi, • • • , Yny) )■ Under the null Hq there are two different ways 
to construct a common estimator of g. First we can consider the aggregate estimator 

{n^c + ny)g + {lix + ny)g 

9q = 1 - , > (-L4j 

nx + ny + nx + riy 

and, second, another estimator can be construct by aggregating the samples. 

3 Simulations and data study 

For all empirical powers or empirical levels we carry out experiments of 10000 samples and 

we use three different sample sizes: n = 50, n = 100, and n = 500. For each replication we 
compute the statistics Ti{y) and T2{y) given by (8) and (13), where y is chosen randomly 
following a standard normal distribution. 

3.1 Study of the empirical levels 

We will denote by M{0, 1) the standard normal distribution with mean zero and variance 

1. We first consider the case where Yf and Yf are independent and A^(0, 1) distributed. 
The bandwidth is chosen as hn = n~^/^ and the trimming as = n~^^^. 

Empirical level To study the empirical levels of Ti and T2 we choose 

giy) = aiy) = exp{(y + 3)/(y + 5)}, 

and we fix a theoretical level a = 5%. Table 1 shows empirical levels of the test under Hq. 
It can be seen that both statistics Ti and T2 provide levels close to the asymptotic value. 
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3.2 Study of the empirical powers 

We consider the model where Yt and Yt are independent and J\f{0, 1) distributed. To study 
the empirical powers of Ti and T2 we consider g(y) = exp((?/ + 3)/(y + 5)) and the four 
following transformations: 

hiy) = exp((y + 3)/(y + 5)) + 1, 52(y) = 2 exp((y + 3)/(y + 5)), 
hiy) = -(y + ll)/(y + 5),54(y) =% + 5, 

and we also study local alternatives by considering: 

, , 2(y + 5) 

55 (y) = + — • 

Tables 2-3 present empirical powers for Ti and T2 under fixed and local alternatives, respec- 
tively, for a theoretical level a equal to 5%. From Table 2 it appears that the knowledge of 
the probability densities of Y and Y allows to have more stable statistics that detect more 
easily the departure from the null hypothesis. Then the test statistic Ti provides better 
power, particularly for smallest sample size. The test statistic T2 has a low empirical power 
for n = 50; but when the sample size n increases, the empirical power of T2 is similar to 
that of Ti. Table 3 indicates that Ti and T2 provide good power for /? < 1/2. For (3 > 1/2 
the power converges to the theoretical level a; this is in accordance with the theoretical 
result stated in Proposition 2.5. 

3.3 Real example: Framingham data 

We consider the Framingham Study on coronary heart disease described by Carroll et al. 
[4]. The data consist of measurements of systolic blood pressure (SBP) obtained at two 
different examinations in 1,615 males on an 8-year follow-up. At each examination, the 
SBP was measured twice for each individual. The four variables of interest are: 

Y = the first SBP at examination 1, 

Y = the second SBP at examination 1, 
X = the first SBP at examination 2, 
X = the second SBP at examination 2. 

Our purpose is to examine whether the distribution of the SBP changed during time, 
and which type of transformation it underwent. Following our notations, we will study 
the transformation between the distributions of Y and X and also the one between the 
distributions of Y and X. Then we assume that X = g{Y) and X = g{Y). 

Table 4 indicates that all the distributions of X, Y, X and Y are skewed to the right 
and are leptokurtic, KS is the Kolomogorov-Smirnov statistic, the associated p-values are 
lesser than 2.210~^ and hence the normality assumption is strongly rejected. Figure 1 
represents nonparametric estimations of the probability densities of X, Y, X, and Y. 

From Figure 1 it seems that the distributions of the variables Y and X have a similar 
shape. However, from Table 4 we observe a noticeable decrease in the mean and an increase 
in the variance. Based on the nonparametric estimators given in Figure 2 we can postulate 
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250 
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Figure 1: Kernel estimates of the probability densities oi X,Y,X,Y. In the top panel : fll 
(resp. f21) is the Kernel estimate of the density of Y (resp. of X). In the bottom panel : 
fl2 (resp. f22) is the Kernel estimate of the density of Y (resp. of X). 




100 150 200 

y 



Figure 2: Nonparametric estimators of g and g and the aggregated estimator on the interval 
[c,d]: gh (resp. gth and gOh) denotes g (resp. g and g^). 
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that only the location and the scale are affected by time, therefore, the transformation 
g is linear; that is, g{y) = ay + h. Similarly the distributions of the variables Y and X 
can be linked by g{y) = ay + b. The functions g, g are estimated on the interval [c, d] 
where c = max(min(l^), min(l^)) and d = min(max(l^), max(lj)). These functions are 
estimated on the grid yi = c + {d — c)i/M, for a given M. 

By applying our test we obtain a p- value very close to 1, and hence we can consider 
that g = g- 

In Figure 2 we observe that all the estimators g, g and 50 are approximately linear on 
the interval [c, d], however in the border (near c and d.) the approximation is not good. One 
can observe that they are constants on regions where there arc not enough observations. 
Therefore, to compute the linear approximation of these estimators we consider only the 
yi belonging to the interval [100, 200]. 

The ordinary least squares based on {yi,g{yi)), {yi,g{yi)) and {yi,go{yi)), yi G [100,200], 
1 < i < 50 yields 

g{y) = 0.9877y + 0.7035, = 0.9857?/ + 0.8335 and goiy) = 0.986y + 0.7685 

By using a parametric approach, i.e. ^p{y) = ay + b, where a = cov{X,Y)/wax{Y),b = 
X — aY, we obtain the following estimators 

gp{y) = 0.760y + 33.075, = 0.726y + 36.730, 
and the common aggregate parametric estimator is given by 

= 0.74% + 34.787. 

To compare the parametric and the nonpar ametric approaches, we consider the aggregate 
estimators and we compare the predicted values for the two first moments of X and X 
with those observed. The predictions of X ( resp. of X) are computed by using the 
observed moments of Y (resp. of Y) and the common transformation. Using the parametric 
approach we get 

fhx = QJUmY + 34.787 = 133.590 
Varx = (0.744)Var(y) = 232.145. 

The nonparametric approach yields 

fhx = 0.9867mr + 0.7685 = 131.8 
Varx = (0.9867) Var(F) = 408.04 

Note that the observed two first moments of X are given by 131.2 and 439.11. 
Similarly for the pair {X,Y), the parametric predictions are given by 

= 0.744my + 34.787 = 131.656 

Vm-^ = (0.744) Var(X) = 226.933. 
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The nonparametric approach yields 



m^ = 0.9867m5> + 0.7685 = 129.237, 
Var^ = (0.9867)Var(y) = 399.137 

Recall that the observed two first moments of X are given by 128.8 and 410.21. 
The predictions of the nonparametric model are more close to the observed values, conse- 
quently the nonparametric approach seems to be more suitable. 
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Table 1: Empirical levels of Ti and T2 (in %) for a theoretical level a = 5% . 





n = 50 


n = 100 


n = 500 




3.9 


4.75 


5.49 


T2 


4.68 


5.52 


5.42 



Table 2: Empirical powers of Ti and T2 (in %) for a theoretical level a = 5%. 





Ti 


T2 


Ti 


T2 




51 


91 


92 


h 


n = 50 


99.98 


99.58 


99.81 


98.17 


n = 100 


99.99 


99.66 


99.91 


98.17 


n = 500 


100 


99.69 


99.96 


98.47 




Ti 


T2 


Ti 


T2 




53 


93 


9a 


9a 


n = 50 


100 


100 


78.59 


71.47 


n = 100 


100 


100 


84.31 


78.41 


n = 500 


100 


100 


92.42 


92.07 



Table 3: Empirical powers of T\ and T2 (in %) for a theoretical level a = 5% under local 
alternative 35. 





Ti 


T2 


Ti 


T2 


Ti 


T2 




15 = 1/ A 


/3 = l/4 


^ = 1/2 


/3 = l/2 


I5 = A 


P = 4 


n = 50 


99.85 


97.06 


99.64 


96.90 


4.19 


4.77 


n= 100 


99.85 


97.30 


99.71 


97.02 


4.77 


5.72 


n = 500 


99.94 


97.85 


99.82 


97.29 


5.36 


5.28 
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Table 4: Descriptive statistics of Pramingham data 



Y 


X 


Mill. 1st Qu. Median Mean 3rd Qu. Max. 
80.0 120.0 130.0 132.8 142.0 230.0 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
88.0 118.0 128.0 131.2 142.0 260.0 


Var. Skewness. Kurtosis. KS. 
419.12 1.27 7.79 0.0119 


Var. Skewness. Kurtosis. KS. 
439.11 1.39 6.65 0.1125 


Y 


X 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
75.0 118.0 128.0 130.2 140.0 270.0 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
85.0 115.0 125.0 128.8 138.0 270.0 


Var. Skewness. Kurtosis. KS. 
409.97 1.4(3 7.25 0.1171 


Var. Skewness. Kurtosis. KS. 
410.21 1.47 7.10 0.1117 
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